{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d27f1082-cd10-405e-9570-6f0e934bba8b",
   "metadata": {},
   "source": [
    "# LlamaParse `JobResult` Tour\n",
    "\n",
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/parse/demo_json.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\n",
    "The `JobResult` object is the main object returned by the LlamaParse API. It contains all the information about the job, including the parsed data, metadata, and any errors.\n",
    "\n",
    "This notebook walks through each component of the `JobResult` object and shows you what it contains.\n",
    "\n",
    "Status:\n",
    "| Last Executed | Version | State      |\n",
    "|---------------|---------|------------|\n",
    "| Aug-19-2025   | 0.6.61  | Maintained |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a004db48-8d3f-421c-915a-477692f71b90",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Let's bring in our imports and set up our API keys."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc6a7a4b-b568-4db5-bcba-62f5c517ff3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-cloud-services"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0879301c-ff91-4431-941a-6c0ef7cd8fe2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# API access to llama-cloud\n",
    "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-..\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b411d2ee-3e6b-45b0-b532-4a8e3abcdea0",
   "metadata": {},
   "source": [
    "## Load Data\n",
    "\n",
    "Let's load a large and complex PDF, San Francisco's 2023 proposed budget."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c39d408f-e885-4940-85c7-b09ca3bc7cb7",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget 'https://www.dropbox.com/scl/fi/vip161t63s56vd94neqlt/2023-CSF_Proposed_Budget_Book_June_2023_Master_Web.pdf?rlkey=hemoce3w1jsuf6s2bz87g549i&dl=0' -O './san_francisco_budget_2023.pdf'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2f42af8-afb3-4b3b-82d3-6b332fb38aa4",
   "metadata": {},
   "source": [
    "## Using LlamaParse for Basic PDF Parsing\n",
    "\n",
    "Let's parse our document!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c9cd670-8229-4ad6-99a9-845bd82b7ec1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Started parsing the file under job_id d12d419a-52fc-400c-9f88-f61b352d3fb2\n"
     ]
    }
   ],
   "source": [
    "from llama_cloud_services import LlamaParse\n",
    "\n",
    "parser = LlamaParse(\n",
    "    parse_mode=\"parse_page_with_agent\",\n",
    "    model=\"openai-gpt-4-1-mini\",\n",
    "    high_res_ocr=True,\n",
    "    adaptive_long_table=True,\n",
    "    outlined_table_extraction=True,\n",
    "    output_tables_as_HTML=True,\n",
    ")\n",
    "result = await parser.aparse(\"./san_francisco_budget_2023.pdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11c22bab",
   "metadata": {},
   "source": [
    "Every job will come back with some metadata about the job:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c588c578",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "JobMetadata(job_credits_usage=0, job_pages=0, job_auto_mode_triggered_pages=0, job_is_cache_hit=True)"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result.job_metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e96b7c9",
   "metadata": {},
   "source": [
    "Since this was a re-run, I can see that a cache hit occurred. Jobs are cached for 48 hours by default."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6543d2c6",
   "metadata": {},
   "source": [
    "Beyond this, we can explore the parsed data per-page:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af9f3717",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "362\n"
     ]
    }
   ],
   "source": [
    "print(len(result.pages))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8845fac",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dict_keys(['page', 'text', 'md', 'images', 'charts', 'tables', 'layout', 'items', 'status', 'links', 'width', 'height', 'triggeredAutoMode', 'parsingMode', 'structuredData', 'noStructuredContent', 'noTextContent'])\n"
     ]
    }
   ],
   "source": [
    "print(result.pages[0].model_dump().keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6261f5e3",
   "metadata": {},
   "source": [
    "Inside the page object, you can see nearly every detail about the page.\n",
    "\n",
    "Most of these will depend on the settings you used when parsing. Since we used the default settings, we get the text and markdown for each page, as well as a list of all the elements on the page.\n",
    "\n",
    "* `page`: this is simply the page number, starting at 1.\n",
    "* `text`: this is the text of the page, as extracted by the parser.\n",
    "* `images`: this is an array of all the images on the page, including metadata and text OCRed out of the images, as well as a full-page screenshot of the entire page.\n",
    "* `charts`: this is an array of all the charts on the page, including metadata and text OCRed out of the charts, as well as a full-page screenshot of the entire chart.\n",
    "* `layout`: this is an array of all the layout elements on the page, if you are using layout mode.\n",
    "* `items`: This is an array of all the parsed elements on the page, as used to render the markdown, but separated out into their own objects. This is useful if you want to do more processing on the data.\n",
    "* `links`: this is an array of all the links on the page, if you are used `annotate_links=True`\n",
    "* `status`: this is the status of the page, which is usually \"OK\" unless there was an error processing the page.\n",
    "* `width` and `height`: these are the dimensions of the page in pixels.\n",
    "* `parsingMode`: Contains the specific parsing mode that was used for the page.\n",
    "* `triggeredAutoMode`: this indicates whether the page triggered auto mode; see [LlamaParse docs](https://docs.cloud.llamaindex.ai/llamaparse/getting_started) for more details.\n",
    "* `structuredData`/`noStructuredContent`: these are set if you are using structured mode; see [LlamaParse docs](https://docs.cloud.llamaindex.ai/llamaparse/getting_started) for more details.\n",
    "* `noTextContent`: this is true if the page was empty of text.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a4cc901",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                        CITY & COUNTY OF SAN FRANCISCO, CALIFORNIA\n",
      " PROPOSED BUDGET\n",
      "                                                              FISCAL YEARS 2023-2024 & 2024-2025\n",
      "                                                                                LONDON  N. BREED\n",
      "               MAYOR’S OFFICE OF PUBLIC POLICY AND FINANCE\n",
      "         Anna Duning, Director of Mayor’s                                   Fisher Zhu, Fiscal and Policy Analyst\n",
      "         Office of Public Policy and Finance                             Anya Shutovska, Fiscal and Policy Analyst\n",
      "         Sally Ma, Deputy Budget Director\n",
      "Radhika Mehlotra, Senior Fiscal and Policy Analyst                        Jack English, Fiscal and Policy Analyst\n",
      "     Damon Daniels, Fiscal and Policy Analyst                           Xang Hang, Junior Fiscal and Policy Analyst\n",
      "    Matthew Puckett, Fiscal and Policy Analyst                       Tabitha Romero-Bothi, Fiscal and Policy Assistant\n"
     ]
    }
   ],
   "source": [
    "print(result.pages[0].text[:1000])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d5a5bc2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# CITY & COUNTY OF SAN FRANCISCO, CALIFORNIA\n",
      "\n",
      "# PROPOSED BUDGET\n",
      "\n",
      "# FISCAL YEARS 2023-2024 & 2024-2025\n",
      "\n",
      "# LONDON N. BREED\n",
      "\n",
      "# MAYOR’S OFFICE OF PUBLIC POLICY AND FINANCE\n",
      "\n",
      "Anna Duning, Director of Mayor’s Office of Public Policy and Finance\n",
      "\n",
      "Fisher Zhu, Fiscal and Policy Analyst\n",
      "\n",
      "Anya Shutovska, Fiscal and Policy Analyst\n",
      "\n",
      "Sally Ma, Deputy Budget Director\n",
      "\n",
      "Radhika Mehlotra, Senior Fiscal and Policy Analyst\n",
      "\n",
      "Jack English, Fiscal and Policy Analyst\n",
      "\n",
      "Damon Daniels, Fiscal and Policy Analyst\n",
      "\n",
      "Xang Hang, Junior Fiscal and Policy Analyst\n",
      "\n",
      "Matthew Puckett, Fiscal and Policy Analyst\n",
      "\n",
      "Tabitha Romero-Bothi, Fiscal and Policy Assistant\n"
     ]
    }
   ],
   "source": [
    "print(result.pages[0].md[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32de4c62",
   "metadata": {},
   "source": [
    "## Images\n",
    "\n",
    "By default, images embedded in documents that can be extracted are part of the result object."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "802d4a98",
   "metadata": {},
   "source": [
    "We can also specify to take screenshots of every page:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ee78f2f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Started parsing the file under job_id e6332422-803b-404d-8d0d-ad510fa56c09\n",
      "..."
     ]
    }
   ],
   "source": [
    "parser = LlamaParse(\n",
    "    parse_mode=\"parse_page_with_agent\",\n",
    "    model=\"openai-gpt-4-1-mini\",\n",
    "    high_res_ocr=True,\n",
    "    adaptive_long_table=True,\n",
    "    outlined_table_extraction=True,\n",
    "    output_tables_as_HTML=True,\n",
    "    # Take screenshot of the page\n",
    "    take_screenshot=True,\n",
    ")\n",
    "result = await parser.aparse(\"./san_francisco_budget_2023.pdf\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fab32886",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ImageItem(name='page_1.jpg', height=792.0, width=612.0, x=0.0, y=0.0, original_width=1236, original_height=1600, type='full_page_screenshot')]\n"
     ]
    }
   ],
   "source": [
    "print(result.pages[0].images)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9eba9e52",
   "metadata": {},
   "source": [
    "We can download images (either their bytes or to a local file) using the `JobResult` object as well!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7aa0a29",
   "metadata": {},
   "outputs": [],
   "source": [
    "# single image\n",
    "image_data = await result.aget_image_data(result.pages[0].images[0].name)\n",
    "\n",
    "# save an image to a file\n",
    "output_path = await result.asave_image(\n",
    "    result.pages[0].images[0].name, \"./json_tour_screenshots\"\n",
    ")\n",
    "\n",
    "# save all images\n",
    "output_paths = await result.asave_all_images(\"./json_tour_screenshots\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eae4ece3",
   "metadata": {},
   "source": [
    "## Items\n",
    "\n",
    "This is an array of all the parsed elements on the page, as used to render the markdown, but separated out into their own objects. This is useful if you want to do more processing on the data. Let's take a look:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c10b9d7d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type='heading' lvl=1 value='CITY & COUNTY OF SAN FRANCISCO, CALIFORNIA' md='# CITY & COUNTY OF SAN FRANCISCO, CALIFORNIA' rows=None bBox=BBox(x=176.0, y=52.0, w=277.0, h=12.0)\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "print(result.pages[0].items[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dcb9f832",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type='heading' lvl=1 value='PROPOSED BUDGET' md='# PROPOSED BUDGET' rows=None bBox=BBox(x=89.0, y=118.0, w=451.0, h=47.0)\n"
     ]
    }
   ],
   "source": [
    "print(result.pages[0].items[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7f64443",
   "metadata": {},
   "source": [
    "As you can see you get different element types: text, headings, and tables. Each comes with its own `md` key containing a Markdown representation of that element, allowing you to easily summarize with only headings, tables only, etc..\n",
    "\n",
    "The ability to extract tables from visual data is really powerful. Let's take a look at page 35, which has some bar charts that get automatically converted into tables:\n",
    "\n",
    "<img src=\"./json_tour_screenshots/page_35.png\" alt=\"Page 35\" width=\"300\"/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4ccee76",
   "metadata": {},
   "source": [
    "The bar chart has been converted into a table, and even though explicit values are not included, the bar chart has been read and approximate values for each bar on the chart have been included!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d6404a5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type='table' lvl=None value=None md=\"Source: U.S. Census Bureau, 2017-2021 American Community Survey 5-years Estimate.\\n|Race|Educational Level|Number of Residents| | | | |\\n|---|---|---|---|---|---|---|\\n|Age Group| | | | | | |\\n|Under 5 Years|5 to 19 Years|20 to 34 Years|35 to 59 Years|60 and Over| | |\\n|Graduate or professional degree|Bachelor's degree|Associate's degree|Some college, no degree|High school graduate (includes equivalency)|9th to 12th grade, no diploma|Less than 9th grade|\" rows=[[], ['Race', 'Educational Level', 'Number of Residents', '', '', '', ''], ['---', '---', '---', '---', '---', '---', '---'], ['Age Group', '', '', '', '', '', ''], ['Under 5 Years', '5 to 19 Years', '20 to 34 Years', '35 to 59 Years', '60 and Over', '', ''], ['Graduate or professional degree', \"Bachelor's degree\", \"Associate's degree\", 'Some college, no degree', 'High school graduate (includes equivalency)', '9th to 12th grade, no diploma', 'Less than 9th grade']] bBox=BBox(x=68.0, y=129.0, w=613.0, h=3067.0)\n"
     ]
    }
   ],
   "source": [
    "print(result.pages[34].items[6])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9570d3b8",
   "metadata": {},
   "source": [
    "### `links`\n",
    "\n",
    "Our budget PDF doesn't have any links, so let's load a different PDF with links and see what we get.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb0da11a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget 'https://www.dropbox.com/scl/fi/hay06lyxc49gkuh91oek6/basic-link-1.pdf?rlkey=uije7yb0lxqgqwk7p7hnqepdx&dl=0' -O './basic-link-1.pdf'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7e393e6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Started parsing the file under job_id 9b2df975-af3c-4868-99e2-520ce0b21f4d\n"
     ]
    }
   ],
   "source": [
    "parser = LlamaParse(\n",
    "    parse_mode=\"parse_page_with_agent\",\n",
    "    model=\"openai-gpt-4-1-mini\",\n",
    "    high_res_ocr=True,\n",
    "    adaptive_long_table=True,\n",
    "    outlined_table_extraction=True,\n",
    "    output_tables_as_HTML=True,\n",
    "    # Annotate links in the document\n",
    "    annotate_links=True,\n",
    ")\n",
    "result = await parser.aparse(\"./basic-link-1.pdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "701ada4b",
   "metadata": {},
   "source": [
    "This is a very simple document with some internal and external links:\n",
    "\n",
    "<img src=\"./json_tour_screenshots/links_page.png\" alt=\"Page 1\" width=\"300\"/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e4de7de",
   "metadata": {},
   "source": [
    "The parser finds the external links and their labels and includes them in the `links` section:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29bf7e3c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'url': 'https://www.antennahouse.com/', 'text': 'Antenna House, Inc.'}, {'url': 'https://www.antennahouse.com/', 'text': 'Linking to a website (https://www.antennahouse.com/)'}]\n"
     ]
    }
   ],
   "source": [
    "print(result.pages[0].links)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac9088a2",
   "metadata": {},
   "source": [
    "This concludes our tour! I hope this makes clear the power of JSON mode and the flexibility it gives you over what parts of your documents you can use."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-parse-aNC435Vv-py3.10",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
